Notebook Description

In looking at the data from Ed's archive, here are the elements that I think are useful for analysis. At the end I have included the tweet I obtained this example data from, as well as a link to the Twitter API description of the twitter object.

Datasets


In [ ]:
ferguson_aug_username_unique.txt

ferguson_data_set_august.csv

ferguson_data_set_small.csv

ferguson_hashtags.csv

ferguson_nov_username_all.txt

ferguson_data_follower_count_gte0.csv

ferguson_data_set.csv

ferguson_data_set_small_user_count_gte100.csv  

ferguson_hashtags_small.csv

User Information:

This is information about the user who posted the tweet in question


In [ ]:
# USERID, USERNAME
t["user"]["id"], t["user"]["screen_name"]
# (352778346, u'LealtadEternaLV')

# LANGUAGE, TIME OFFSET (ENGLISH ONLY, AND US TIMEZONES TO NARROW DOWN TWEETS?)
t["user"]["lang"], t["user"]["utc_offset"]
(u'es', -16200)

# FOLLOWERS (ONE WAY), FRIENDS (TWO WAY FOLLOWERS)
t["user"]["followers_count"], t["user"]["friends_count"]
# (3109, 3377)

Tweet Information

Information about the current tweet


In [ ]:
# UID FOR THE TWEET
t["id"]
# 502138116384100352

# TIME TWEET WAS CREATED (SAME AS SENT?)
t["created_at"]
# Wed Aug 20 17:00:30 +0000 2014

# TIMEZONE OFFSET (NECCESSARY FOR "NORMALIZING" TIME?)
t["user"]["utc_offset"]
# -16200

# ACTUAL TWEET (I'D WISH THERE WAS A WAY TO PARSE THIS SO WE ONLY HAD THE TEXT, BUT OFTEN HASHTAGS ARE USED
# AS WORDS IN THE TEXT.  I THINK WE SHOULD AT LEAST GET RID OF "RT" and LINKS)
t["text"]
# u'RT @RomelBolivar: #VenezuelaPuebloHumanitario Asi Repelen los disturbios en 
#Ferguson #EEUU Opositor esto tu no lo llamas Represi\xf3n ? http:\u2026'

# ALL HASHTAGS IN THE TEXT
t["entities"]["hashtags"]
# [{u'indices': [18, 45], u'text': u'VenezuelaPuebloHumanitario'},
#  {u'indices': [76, 85], u'text': u'Ferguson'},
#  {u'indices': [86, 91], u'text': u'EEUU'}]

# USERS MENTIONED IN THE TWEET, ID and SCREENNAME
t["entities"]["user_mentions"][0]["id"], t["entities"]["user_mentions"][0]["screen_name"]
# (108731188, u'RomelBolivar')

# NUMBER OF PEOPLE WHO FAVORITED THE CURRENT TWEET, IF RETWEET THIS WILL BE DIFFERENT THAN THE ORIGINAL
t["favorite_count"]
# 34

In [ ]:

Retweets

If the tweet in question is actually a retweet, it will come along with a set of information about the original tweet that the current user is retweeting.

General info about the original tweet currently being retweeted:


In [ ]:
# RETWEETED? (THERE IS A "retweeted" field but it shows false despite the count being greater than zero.  
# This might be due to the fact that retweets of retweets are not counted so perhaps this retweet hasn't
# been retweeted but the original tweet was. Anyway, hence the check if there is a retweet_count.
bool(t["retweet_count"])
# true

# NUMBER OF RETWEETS OF THE CURRENT TWEET
t["retweet_count"]
# 109

# ID OF THE ORIGINAL TWEET BEING RETWEETED (FOR NETWORK ANALYSIS?)
t["retweeted_status"]["id"]
# 502098277094129664

# DATE TIME ORIGINAL TWEET WAS CREATED
t["retweeted_status"]"created_at"]
# u'Wed Aug 20 14:22:12 +0000 2014'

# NUMBER OF FAVORITES OF THE ORIGINAL TWEET
t["retweeted_status"]["favorite_count"]
# 10

# TOTAL FAVORITES OF THE ORINGAL TWEET, OR NUMBER OF TOTAL TWEETS THAT HAVE BEEN FAVORITED FROM THIS USER?
t["retweeted_status"]["favourites_count"]
# 3479
User information about the original author of the current tweet being retweeted:

In [ ]:
# SCREEN NAME OF ORIGINAL USER OF RETWEETED TWEET
t["retweeted_status"]["user"]["screen_name"]
# u'RomelBolivar',

# USER ID OF ORIGINAL USER OF RETWEETED TWEET
t["retweeted_status"]["user"]["id"]
# 108731188,
         
# TOTAL FOLLOWERS OF ORINGAL TWEET USER
t["retweeted_status"]["user"]["followers_count"]
# 59545

# TOTAL FRIENDS OF ORIGINAL TWEET USER
t["retweeted_status"]["user"]["friends_count"]
# 59222

# TIMEZONE OF ORIGNAL USER OF RETWEETED TWEET
t["retweeted_status"]["user"]["utc_offset"]
# -16200,
Tweet and user info if the current tweet being retweeted was originally a reply:

In [ ]:
# SCREEN NAME OF THE USER THE RETWEET WAS ORIGNALLY REPLYING TO
t["retweeted_status"]["in_reply_to_screen_name"], 
# u'ReplyToUsername', 

# USER ID OF THE USER THE RETWEET WAS ORIGINALLY REPLYING TO
t["retweeted_status"]["in_reply_to_user_id"]
# 837103894

# TWEET ID OF THE TWEET THE RETWEET WAS ORIGINALLY REPLYING TO
t["retweeted_status"]["in_reply_to_status_id"]
# 515477293597193041)

Replies

If the current tweet is a reply to another tweet:


In [ ]:
# SCREEN NAME OF THE PERSON THE TWEET IS REPLYING TO
t["in_reply_to_screen_name"], 
# u'HassanTheeOne', 

# USER ID OF THE PERSON THE TWEET IS REPLYING TO
["in_reply_to_user_id"]
# 23497553, 

# TWEET ID OF THE ORIGINAL TWEET THE CURRENT TWEET IS REPLLYING TO
t["in_reply_to_status_id"]
# 502137293596262401)

Example Tweet

Note, in some cases you might find that this example tweet does not have data for items that do have data above. This is because some cases, this example tweet didn't have any data to make an example of (say it wasn't in reply to anything. Feel free to pull data from above and fill it in so that this example is more complete.


In [ ]:
{u'contributors': None,
 u'coordinates': None,
 u'created_at': u'Wed Aug 20 17:00:30 +0000 2014',
 u'entities': {u'hashtags': [{u'indices': [18, 45],
    u'text': u'VenezuelaPuebloHumanitario'},
   {u'indices': [76, 85], u'text': u'Ferguson'},
   {u'indices': [86, 91], u'text': u'EEUU'}],
  u'media': [{u'display_url': u'pic.twitter.com/8Hc2Upitgk',
    u'expanded_url': u'http://twitter.com/RomelBolivar/status/502098277094129664/photo/1',
    u'id': 502098274686623744,
    u'id_str': u'502098274686623744',
    u'indices': [139, 140],
    u'media_url': u'http://pbs.twimg.com/media/BvfPuHkIYAAmtix.jpg',
    u'media_url_https': u'https://pbs.twimg.com/media/BvfPuHkIYAAmtix.jpg',
    u'sizes': {u'large': {u'h': 339, u'resize': u'fit', u'w': 600},
     u'medium': {u'h': 338, u'resize': u'fit', u'w': 600},
     u'small': {u'h': 192, u'resize': u'fit', u'w': 340},
     u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}},
    u'source_status_id': 502098277094129664,
    u'source_status_id_str': u'502098277094129664',
    u'type': u'photo',
    u'url': u'http://t.co/8Hc2Upitgk'}],
  u'symbols': [],
  u'urls': [],
  u'user_mentions': [{u'id': 108731188,
    u'id_str': u'108731188',
    u'indices': [3, 16],
    u'name': u'Romel Bol\xedvar ',
    u'screen_name': u'RomelBolivar'}]},
 u'extended_entities': {u'media': [{u'display_url': u'pic.twitter.com/8Hc2Upitgk',
    u'expanded_url': u'http://twitter.com/RomelBolivar/status/502098277094129664/photo/1',
    u'id': 502098274686623744,
    u'id_str': u'502098274686623744',
    u'indices': [139, 140],
    u'media_url': u'http://pbs.twimg.com/media/BvfPuHkIYAAmtix.jpg',
    u'media_url_https': u'https://pbs.twimg.com/media/BvfPuHkIYAAmtix.jpg',
    u'sizes': {u'large': {u'h': 339, u'resize': u'fit', u'w': 600},
     u'medium': {u'h': 338, u'resize': u'fit', u'w': 600},
     u'small': {u'h': 192, u'resize': u'fit', u'w': 340},
     u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}},
    u'source_status_id': 502098277094129664,
    u'source_status_id_str': u'502098277094129664',
    u'type': u'photo',
    u'url': u'http://t.co/8Hc2Upitgk'}]},
 u'favorite_count': 0,
 u'favorited': False,
 u'geo': None,
 u'id': 502138116384100352,
 u'id_str': u'502138116384100352',
 u'in_reply_to_screen_name': None,
 u'in_reply_to_status_id': None,
 u'in_reply_to_status_id_str': None,
 u'in_reply_to_user_id': None,
 u'in_reply_to_user_id_str': None,
 u'lang': u'es',
 u'place': None,
 u'possibly_sensitive': False,
 u'retweet_count': 109,
 u'retweeted': False,
 u'retweeted_status': {u'contributors': None,
  u'coordinates': None,
  u'created_at': u'Wed Aug 20 14:22:12 +0000 2014',
  u'entities': {u'hashtags': [{u'indices': [0, 27],
     u'text': u'VenezuelaPuebloHumanitario'},
    {u'indices': [58, 67], u'text': u'Ferguson'},
    {u'indices': [68, 73], u'text': u'EEUU'}],
   u'media': [{u'display_url': u'pic.twitter.com/8Hc2Upitgk',
     u'expanded_url': u'http://twitter.com/RomelBolivar/status/502098277094129664/photo/1',
     u'id': 502098274686623744,
     u'id_str': u'502098274686623744',
     u'indices': [116, 138],
     u'media_url': u'http://pbs.twimg.com/media/BvfPuHkIYAAmtix.jpg',
     u'media_url_https': u'https://pbs.twimg.com/media/BvfPuHkIYAAmtix.jpg',
     u'sizes': {u'large': {u'h': 339, u'resize': u'fit', u'w': 600},
      u'medium': {u'h': 338, u'resize': u'fit', u'w': 600},
      u'small': {u'h': 192, u'resize': u'fit', u'w': 340},
      u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}},
     u'type': u'photo',
     u'url': u'http://t.co/8Hc2Upitgk'}],
   u'symbols': [],
   u'urls': [],
   u'user_mentions': []},
  u'extended_entities': {u'media': [{u'display_url': u'pic.twitter.com/8Hc2Upitgk',
     u'expanded_url': u'http://twitter.com/RomelBolivar/status/502098277094129664/photo/1',
     u'id': 502098274686623744,
     u'id_str': u'502098274686623744',
     u'indices': [116, 138],
     u'media_url': u'http://pbs.twimg.com/media/BvfPuHkIYAAmtix.jpg',
     u'media_url_https': u'https://pbs.twimg.com/media/BvfPuHkIYAAmtix.jpg',
     u'sizes': {u'large': {u'h': 339, u'resize': u'fit', u'w': 600},
      u'medium': {u'h': 338, u'resize': u'fit', u'w': 600},
      u'small': {u'h': 192, u'resize': u'fit', u'w': 340},
      u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}},
     u'type': u'photo',
     u'url': u'http://t.co/8Hc2Upitgk'}]},
  u'favorite_count': 10,
  u'favorited': False,
  u'geo': None,
  u'id': 502098277094129664,
  u'id_str': u'502098277094129664',
  u'in_reply_to_screen_name': None,
  u'in_reply_to_status_id': None,
  u'in_reply_to_status_id_str': None,
  u'in_reply_to_user_id': None,
  u'in_reply_to_user_id_str': None,
  u'lang': u'es',
  u'place': None,
  u'possibly_sensitive': False,
  u'retweet_count': 109,
  u'retweeted': False,
  u'source': u'<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>',
  u'text': u'#VenezuelaPuebloHumanitario Asi Repelen los disturbios en #Ferguson #EEUU Opositor esto tu no lo llamas Represi\xf3n ? http://t.co/8Hc2Upitgk',
  u'truncated': False,
  u'user': {u'contributors_enabled': False,
   u'created_at': u'Tue Jan 26 22:04:03 +0000 2010',
   u'default_profile': False,
   u'default_profile_image': False,
   u'description': u'Revolucionario, Humanista, Antiimperialista ... UNIDAD,LUCHA,BATALLA Y VICTORIA .......ALERTA SIEMPRE',
   u'entities': {u'description': {u'urls': []},
    u'url': {u'urls': [{u'display_url': u'contactoconlarealidad.com',
       u'expanded_url': u'http://www.contactoconlarealidad.com/',
       u'indices': [0, 22],
       u'url': u'http://t.co/pvIVI3EBEH'}]}},
   u'favourites_count': 3479,
   u'follow_request_sent': False,
   u'followers_count': 59545,
   u'following': False,
   u'friends_count': 59222,
   u'geo_enabled': True,
   u'id': 108731188,
   u'id_str': u'108731188',-
  u'location': u'Caracas venezuela',
  u'name': u'Lealtad a Ch\xe1vez!',
  u'notifications': False,
  u'profile_background_color': u'177FED',
  u'profile_background_image_url': u'http://pbs.twimg.com/profile_background_images/466810585368125440/g_3AJDX7.png',
  u'profile_background_image_url_https': u'https://pbs.twimg.com/profile_background_images/466810585368125440/g_3AJDX7.png',
  u'profile_background_tile': True,
  u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/352778346/1412524332',
  u'profile_image_url': u'http://pbs.twimg.com/profile_images/518790606118600704/uvucm1rX_normal.jpeg',
  u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/518790606118600704/uvucm1rX_normal.jpeg',
  u'profile_link_color': u'ABB8C2',
  u'profile_location': None,
  u'profile_sidebar_border_color': u'FFFFFF',
  u'profile_sidebar_fill_color': u'EFEFEF',
  u'profile_text_color': u'333333',
  u'profile_use_background_image': True,
  u'protected': False,
  u'screen_name': u'LealtadEternaLV',
  u'statuses_count': 8564,
  u'time_zone': u'Caracas',
  u'url': None,
  u'utc_offset': -16200,
  u'verified': False}}

User Collection

Since our predictions will be on a user basis, we'll need to aggregate the data across the tweets per user.

BASIC USER INFORMATION:

uid: id of screenname of the author of the group of tweets

screen_name: username of the author of the group of tweets

user_time: the length of time the user has been a member of twitter


USER TWEET ACTIVIITY:

total_tweets: sum of any tweets, retweets, replies sent by a given user, not including retweets of their tweets.

total_orig: sum of tweets sent by this user that are originally produced by them, or a reply.

total_retweets: sum of tweets sent by this user that were retweets

total_replies: sum of all tweets sent by this user that were replies to someone else

Rates: you could do rates for different tweets over time (total_tweets/some_period). Maybe having sustained activitiy is indicative of something as opposed to a large volume that took place in a single day. Do we take these rates as totals over entire time period? Do we take the rates of the times they were active?


USER TWEET QUALITY:

total_retweeted: sum of all of the users' tweets that were retweeted. This would involve going through all of the tweets in the entire sample, and looking at retweeted_status: user: screen_name or id, and keeping a count for each user. Or, group tweets by retweets by id of tweet being retweeted, then taking the max retweet_count.

total_mentioned: will need to look at entities>>user mentions of all tweets. For each user mention, update a count of mentions from a list of users.

total_replied: take a look at all tweets that have in_reply_to_screen_name or in_reply_to_id. For each reply, update a count of replies from a list of users.

total_favorites: take a look at all tweets that have in_reply_to_screen_name or in_reply_to_id. For each reply, update a count of favorites from a list of users. Or, take all tweets that have the aforementioned qualities and group by tweet by date, then take the count from the last time the tweet appeared.

percent_positive: percent of user's August tweets that are classified as positive.


USER NETWWORK

ferg_follwers: list of id's of those that follow the current user

ferg_friends: list of id's of those that are friends of the current user

ferg_follwers_count: sum ferg_follwers

ferg_friends_count: sum ferg_friends

ferg_friend_follower_ratio: total_ferg_follwers/total_ferg_friends (what type of ties does one prefer to keep)


Other possibilities:

followers/friends_growth: this would be a rate of growth of friends/followers from the start until end of the event. A measure of the "gravity well" of following a user has. This would require looking at the tweeets of each user by date and grabbing the follower count at the first tweet and the last tweet and then doing some simple arithmatic.

Another possibilitiy is that we can have all of those things for this user beyond just this event. Total friends and followers on Twitter in general. For this protoypical analysis and the limited reserouces, I think its more useful to have them for just the Fergusson topic.


In [1]:
user document

{user: {
 
    uid: id of screenname of the author of the group of tweets,

    screen_name: username of the author of the group of tweets,

    joined: date of account creation, user.created_at from tweet

    user_time: the length of time the user has been a member of twitter
    },
 
 tweets: { 

    activity: {

        total_tweets: sum of any tweets, retweets, replies sent by a given user, not including retweets of their tweets.,

        total_orig: sum of tweets sent by this user that are originally produced by them, or a reply.,

        total_retweets: sum of tweets sent by this user that were retweets,

        total_replies: sum of all tweets sent by this user that were replies to someone else,

        Rates: you could do rates for different tweets over time (total_tweets/some_period). 
        Maybe having sustained activitiy is indicative of something as opposed to a large volume 
        that took place in a single day. Do we take these rates as totals over entire time period? 
        Do we take the rates of the times they were active?

        tweets: all tweets, or tweet ID's?  would be nice to have, unless it could be an overlapping index
        },

    quality: {
        total_retweeted: sum of all of the users' tweets that were retweeted. This would involve going through 
        all of the tweets in the entire sample, and looking at retweeted_status: user: screen_name or id, and 
        keeping a count for each user. Or, group tweets by retweets by id of tweet being retweeted, then taking the 
        max retweet_count.,

        total_mentioned: will need to look at entities>>user mentions of all tweets. For each user mention, 
        update a count of mentions from a list of users.,

        total_replied: take a look at all tweets that have in_reply_to_screen_name or in_reply_to_id. 
        For each reply, update a count of replies from a list of users.,

        total_favorites: take a look at all tweets that have in_reply_to_screen_name or in_reply_to_id. 
        For each reply, update a count of favorites from a list of users. Or, take all tweets that have the aforementioned 
        qualities and group by tweet by date, then take the count from the last time the tweet appeared.,

        percent_positive: percent of user's August tweets that are classified as positive.
        }
    },

  network: {

    ferg_follwers: list of id's of those that follow the current user

    ferg_friends: list of id's of those that are friends of the current user

    ferg_follwers_count: sum ferg_follwers

    ferg_friends_count: sum ferg_friends

    ferg_friend_follower_ratio: total_ferg_follwers/total_ferg_friends (what type of ties does one prefer to keep)
    }

}


  File "<ipython-input-1-7a782e762ea9>", line 1
    user document
                ^
SyntaxError: invalid syntax

In [ ]: